Synchronous relaxation algorithm for parallel kinetic Monte Carlo 
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We investigate the applicability of the synchronous relaxation (SR) algorithm to parallel kinetic 
Monte Carlo simulations of simple models of thin-film growth. A variety of techniques for optimizing 
the parallel efficiency are also presented. We find that the parallel efficiency is determined by 
three main factors — the calculation overhead due to relaxation iterations to correct boundary 
events in neighboring processors, the (extreme) fluctuations in the number of events per cycle in 
each processor, and the overhead due to interprocessor communications. Due to the existence of 
fluctuations and the requirement of global synchronization, the SR algorithm does not scale, i.e. the 
parallel efficiency decreases logarithmically as the number of processors increases. The dependence 
of the parallel efficiency on simulation parameters such as the processor size, domain decomposition 
geometry, and the ratio D/F of the monomer hopping rate D to the deposition rate F is also 
discussed. 

PACS numbers: 81.15.Aa, 05.10.Ln, 05.10.-a, 89.20.Ff 



I. INTRODUCTION 
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Kinetic Monte Carlo (KMC) is an extremely efficient 



method 



1.2.3.4.5 



to carry out dynamical simulations when 



the relevant activated atomic-scale processes are known, 
and KMC simulations have been used to model a variety 
of dynamical processes ranging from catalysis to thin- 
film growth. The basic principle of kinetic Monte Carlo 
is that the probability that a given event will be the next 
event to occur is proportional to the rate for that event. 
Since all processes are assumed to be independent Pois- 
son processes, the time of the next event is determined 
by the total overall rate for all processes, and after each 
event the rates for all processes are updated as necessary. 

In contrast to Metropolis Monte Carlop in which 
each Monte Carlo step corresponds to a configuration- 
independent time interval and each event is selected 
randomly but only accepted with a configuration- 
dependent probability, in kinetic Monte Carlo both the 
selected event and the time interval between events are 
configuration-dependent while all attempts are accepted. 
In the context of traditional Monte Carlo simulations 
this is sometimes referred to as the n-fold way. 1 While 
KMC requires additional book-keeping to keep track of 
the rates of all possible events, the KMC algorithm is 
typically significantly more efficient than the Metropolis 
algorithm since no selected moves are rejected. In par- 
ticular, for problems such as thin-film growth in which 
the possible rates or probabilities for events can vary by 
several orders of magnitude, the kinetic Monte Carlo al- 
gorithm can be orders of magnitude more efficient than 
Metropolis Monte Carlo. 

Because the attempt time in Metropolis Monte Carlo is 
independent of system configuration, parallel Metropolis 
Monte Carlo simulations may be carried out by using an 
asynchronous "conservative" algorithm ,L2i2ii2iiI In such 
an algorithm all processors whose next attempt time is 
less than their neighbor's next attempt times are allowed 



to proceed. Unfortunately such a "conservative" algo- 
rithm does not work for kinetic Monte Carlo since in 
KMC the event-time depends on the system configura- 
tion. In particular, since fast events may "propagate" 
across processors, the time for an event already executed 
by a processor may change due to earlier events in nearby 
processors, thus leading to an incorrect evolution. As a 
result, the development of efficient parallel algorithms for 
kinetic Monte Carlo simulations remains a challenging 
problem. 

Lubachevsky has developed^ and Korniss et al have 
implemented^ a more efficient version of the conservative 
asynchronous algorithm for parallel dynamical Monte 
Carlo simulations of the spin-flip Ising model. The basic 
idea is to apply Metropolis dynamics to events on the 
boundary of a processor, but to accelerate interior moves 
by using the n-fold way. The choice of a boundary or 
interior move is determined by the ratio of the number 
of boundary sites to the sum of the acceptance prob- 
abilities for all interior moves. While all "n-fold way" 
interior moves are immediately accepted, all Metropo- 
lis attempts must wait until the neighboring processor's 
next attempt time is later before being either accepted 
or rejected. Since such an algorithm is equivalent to the 
conservative Metropolis Monte Carlo algorithm described 
above, it is generally scalablej2ii2iAi and has been found 
to be relatively efficient in the context of kinetic Ising 
model simulations in the metastable regimeiiSii&ii 

Recently we have showni£ that such an approach can 
be generalized in order to carry out parallel KMC simu- 
lations. In this approach, all possible KMC moves are first 
mapped to Metropolis moves with an acceptance prob- 
ability for each event given by the rate for that event 
divided by the fastest possible rate in the KMC simu- 
lation. At each stage, the choice of a boundary move 
versus an interior move is determined by the ratio of a 
fixed number corresponding to the sum of the rates for 
all possible events which might occur in the boundary 



region to the sum of the rates for all existing interior 
moves. However, because of the possibility of significant 
rejection of boundary events, the parallel efficiency of 
such an algorithm can be very low for problems with a 
wide range of rates for different processes. For exam- 
ple, we have recently^ used such a mapping to carry out 
parallel KMC simulations of a simple 2D solid-on-solid 
"fractal" model of submonolayer growth with a moder- 
ate ratio D/F = 10 5 of monomer hopping rate D to (per 
site) deposition rate F. However, due to the existence 
of significant rejection of boundary events, very low par- 
allel efficiencies were obtained. 15 Furthermore, in order 
to use such an approach, in general one needs to know 
in advance all the possible events and their rates and 
then to map them to Metropolis dynamics so that all 
events may be selected with the appropriate probabili- 
ties. While such a mapping may be carried out for the 
simplest models, for more complicated models it is likely 
to be prohibitive. 

In order to overcome these problems, we have recently 
proposed a semi-rigorous synchronous sublattice (SL) 
parallel algorithmic in which each processor's domain is 
further divided into sublattices in order to avoid a pos- 
sible conflict between processors. At the beginning of a 
cycle, one sublattice is randomly selected so that all pro- 
cessors operate on the same sublattice. Each processor 
then carries out KMC events for the selected sublattice 
over a time interval which is typically smaller than the 
inverse of the fastest single-event rate. At the end of each 
cycle, each processor communicates with its neighboring 
processors in order to update its boundary region. By 
carrying out extensive simulations of simple models of 
thin-film growth we have found that this algorithm loads 
to a relatively high parallel efficiency and is scalable, i.e. 
the parallel efficiency is constant as a function of the 
number of processors N p . However, for extremely small 
processor sizes (smaller than a typical diffusion length 
in epitaxial growth) weak finite-size effects are observed. 
Thus for problems in which the diffusion length is large or 
very small processor sizes are required, it may be prefer- 
able to use a more rigorous algorithm. 

In this paper we discuss the application of a sec- 
ond rigorous algorithm, the synchronous relaxation (SR) 
algorithmjiLiS to kinetic Monte Carlo simulations. This 
algorithm was originally used by Eick et alii to simulate 
large circuit-switched communication networks. More re- 
cently an estimate of its efficiency has been carried out by 
Lubachevsky and WeissiC in the context of Ising model 
simulations, In the SR algorithm, all processors remain 
globally synchronized at the beginning and end of a time 
interval, while an iterative relaxation method is used to 
correct errors due to neighboring processors' boundary 
events. Since this algorithm is rigorous, the cycle length 
can be tuned to optimize the parallel efficiency and sev- 
eral optimization methods are discussed. However, we 
find that the requirement of global synchronization leads 
to a logarithmic increase with increasing processor num- 
ber in both the relevant fluctuations in the number of 
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FIG. 1: Schematic diagram of (a) square and (b) strip decom- 
positions. Solid lines correspond to processor domains while 
dashed lines indicate "ghost-region" surrounding central pro- 
cessor. 



events per processor as well as the global communica- 
tion time. Accordingly, the SR algorithm does not scale 
since the parallel efficiency decreases logarithmically as 
the number of processors increases. As a result, the par- 
allel efficiency is generally significantly smaller than for 
the SL algorithm. 

The organization of this paper is as follows. In Sec- 
tion II we describe the algorithm and discuss several 
different methods of optimization. In Section III we 
present results obtained using this algorithm for three 
different models of thin-film growth, along with a brief 
comparison with serial results. We then discuss the 
three key factors— number of additional iterations, fluctu- 
ations, and communications time— which determine the 
parallel efficiency of the SR algorithm. The dependence 
of the parallel efficiency on such parameters as the num- 
ber of processors as well as the cycle length, processor 
size, and ratio D/F of monomer hopping rate D to (per 
site) deposition rate F is also discussed. Finally, in Sec- 
tion IV we summarize our results. 



II. SYNCHRONOUS RELAXATION (SR) 
ALGORITHM 



As in previous work on the "conservative" asyn- 
chronous algorithm, 8-9 in the synchronous relaxation 
(SR) algorithm, different parts of the system are assigned 
to different processors via spatial decomposition. For the 
thin-film growth simulations considered here, one may 
consider two possible methods of spatial decomposition, a 
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FIG. 2: Diagram showing time evolution in the synchronous 
relaxation algorithm. Dashed lines correspond to events, 
while dashed line with an X corresponds to an event which is 
rejected since it exceeds the time interval T. Arrows indicate 
boundary events carried out by processors Pi and Pi. 



square decomposition and a strip decomposition as shown 
in Fig. ^ Since the square decomposition requires com- 
munications with 4 neighbors while the strip decomposi- 
tion only requires communications with 2 neighbors, we 
expect that the strip decomposition will have reduced 
communication overhead. Accordingly, all the results 
presented here are for the case of strip decomposition. 
However, as discussed in more detail later, there may be 
some cases where the square decomposition is preferable. 

In order to avoid communicating with processors be- 
yond the nearest-neighbors, the processor size must be 
larger than the range of interaction (typically only a few 
lattice units in simulations of thin- film growth). In ad- 
dition, in order for each processor to calculate its event 
rates, the configuration in neighboring processors must 
be known as far as the range of interaction. As a result, 
in addition to containing the configuration information 
for its own domain, each processor's array also contains 
a "ghost-region" which includes the relevant information 
about the neighboring processor's configuration beyond 
the processor's boundary. 

We now describe the SR algorithm in detail. At the 
beginning of each cycle corresponding to a time inter- 
val T, each processor initializes its time to zero. A first 
iteration is then performed in which each processor car- 
ries out KMC events until the time of the next event 
exceeds the time interval T as shown in Fig. [21 As in 
the usual serial KMC, each event is carried out with time 
increment Aij = — \n{ri)/Ri where Ti is a uniform ran- 
dom number between and 1 and Ri is the total KMC 
event rate. At the end of each iteration, each processor 
communicates any boundary events with its neighbor- 
ing processors, i.e. any events which are in the range of 
interaction of a neighboring processor and which could 
thus potentially affect the neighboring processor's event 
rates and times. Here we define an event as consisting 
of the lattice sites which have changed due to the event 
along with the unique time U of the event. If at the end 
of a given iteration, a processor has received any new 
or missing boundary events (i.e. any boundary events 
different from those received in the previous iteration) 



then that processor must "undo" all of the KMC events 
which occurred after the time of the earliest new or miss- 
ing boundary event, and then perform another iteration 
starting at that point using the new boundary informa- 
tion received. However, if no processors have received 
new or missing boundary events, then the iterative re- 
laxation is complete, and all processors move on to the 
next cycle. In order to check for this, a global communi- 
cations between all processors is performed at the end of 
each iteration. 

In order for the iteration process to converge, one must 
ensure that within the same cycle the same starting con- 
figuration always leads to the same event or transition. 
Since pseudorandom numbers are used in the KMC sim- 
ulations considered here, this requires keeping a list of all 
the random numbers used during that cycle, and back- 
tracking appropriately along the random number list as 
events are "undone" so that the same random numbers 
are used for the same configurations. In the ideal imple- 
mentation described above, events are only redone start- 
ing from the earliest new boundary event. However, in 
the KMC simulations carried out here, lists were used 
to efficiently select and keep track of all possible events. 
Since properly undoing such lists is somewhat complex, 
here we have used a slightly less efficient but simpler 
method in which every iteration was restarted at the be- 
ginning of the cycle. In this case, the necessary changes 
in the configuration and random numbers were "undone" 
back to the beginning of the cycle, while the state of the 
lists at the beginning of the cycle was restored. Since 
there is significant overhead associated with "undoing" 
each move, and since in every iteration except the first, 
one needs to "undo" on average only half of the events in 
the previous iteration we estimate such a simplification 
leads to at most a 25% reduction in the parallel efficiency. 

We now consider the general dependence of the paral- 
lel efficiency on the cycle time T. If the cycle time is too 
short then there will be a small number of events in each 
cycle and as a result there will be large fluctuations in the 
number of events in different processors. This leads to 
poor utilization, i.e. some processors may process events 
during a given cycle while others may have very few or no 
events. In addition, for a short cycle time the communi- 
cation latency may become comparable to the calculation 
time which also leads to a reduction in the parallel effi- 
ciency. On the other hand, a very long cycle time will 
lead to a large number of boundary events in each cycle, 
and as a result the number of relaxation iterations will 
be large. Thus, in general the cycle length T must be 
optimized in order to balance out the competing effects 
of communication latency, fluctuations, and iterations in 
order to obtain the maximum possible efficiency. 

We have used three different methods to control the 
time interval T in order to optimize the parallel efficiency. 
In the first method, we have used a fixed cycle length 
(e.g. T = 4>/D where D is the monomer hopping rate) 
and then carried out simulations with different values 
of <j) in order to determine the optimal cycle length and 



maximize the parallel efficiency. In the second method, 
we have used feedback to dynamically control the cycle 
length T during a simulation. In particular, every 3—10 
cycles corresponding to a feedback interval, the elapsed 
execution time was either calculated or measured, and 
then used to calculate the ratio p (proportional to the 
parallel efficiency) of the average number of events per 
cycle n av to the execution time. Based on the values of 
p obtained during the previous two feedback intervals, 
the cycle length T was adjusted in order to maximize the 
parallel efficiency. In the third optimization method the 
cycle length was dynamically controlled in order to attain 
a pre-determined value for a target quantity such as the 
number of events per cycle (n op t) or the number of itera- 
tions per cycle (n^) whose optimal value was determined 
in advance. This method turned out be the most effec- 
tive since the parallel performance depends strongly on 
the number of iterations and/or the number of events per 
cycle. In contrast, while the parallel efficiency obtained 
using direct feedback was significantly better than that 
obtained using the first method, it was not quite as good 
as that obtained using the third method described above. 
As a result, here we focus mainly on the first and last 
methods. Unfortunately, both of these methods require 
the use of additional simulations in order to determine 
the optimal parameters. 



III. RESULTS 

In order to test the performance and accuracy of the 
synchronous relaxation algorithm we have used it to sim- 
ulate three specific models of thin-film growth. In partic- 
ular, we have studied three solid-on-solid (SOS) growth 
models on a square lattice: a "fractal" growth model, an 
edge-and-corner diffusion (EC) model, and a reversible 
model with one-bond detachment ("reversible model"). 
In each of these three models the lattice configuration 
is represented by a two-dimensional array of heights and 
periodic boundary conditions are assumed. In the "frac- 
tal" modelfiS- atoms (monomers) are deposited onto a 
square lattice with (per site) deposition rate F, diffuse 
(hop) to nearest-neighbor sites with hopping rate D and 
attach irreversibly to other monomers or clusters via a 
nearest- neighbor bond (critical island size of 1). The 
key parameter is the ratio D/F which is typically much 
larger than one in epitaxial growth. In this model frac- 
tal islands are formed in the submonolayer regime due 
to the absence of island relaxation. The EC model is 
the same as the fractal model except that island relax- 
ation is allowed, i.e. atoms which have formed a single 
nearest-neighbor bond with an island may diffuse along 
the edge of the island with diffusion rate D e = r e D and 
around island-corners with rate D c = r c D (see Fig. |3}- 
Finally, the reversible model is also similar to the frac- 
tal model except that atoms with one-bond (edge-atoms) 
may hop along the step-edge or away from the step with 
rate D\ = riD, thus allowing both edge-diffusion and 
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FIG. 3: Schematic diagram of island-relaxation mechanisms 
for (a) edge-and-corner and (b) reversible models. 



single-bond detachment. For atoms hopping up or down 
a step, an extra Ehrlich-Schwoebel barrier to interlayer 
diffusion^S may also be included. In this model, the crit- 
ical island size »2i can vary from i = 1 for small values of 
ri, to i = 3 for sufficiently large values of D/F and r\2& 
For the fractal and reversible models, the range of in- 
teraction corresponds to one nearest-neighbor (lattice) 
spacing, while for the EC model it corresponds to the 
next-nearest-neighbor distance. Thus, for these mod- 
els the width of the "ghost-region" corresponds to one 
lattice-spacing. We note that at each step of the sim- 
ulation, either a particle is deposited or a particle dif- 
fuses to a nearest-neighbor or next-nearest-neighbor lat- 
tice site. In more general models, for which concerted 
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the 



moves involving several atoms may occur, 

ghost region needs to be at least as large as the range of 

interaction and/or the largest possible concerted move. 

In such a case, the processor to which a concerted event 

belongs can be determined by considering the location of 

the center-of-mass of the atoms involved in the concerted 

move. 

In order to maximize both the serial and parallel ef- 
ficiency in our KMC simulations, we have used lists to 
keep track of all possible events of each type and rate. 
Each processor maintains a set of lists which contains all 
possible moves of each type. A binary tree is used to 
select which type of move will be carried out, while the 
particular move is then randomly chosen from the list of 
the selected type. After each move, the lists are updated. 



A. Computational Details 

In order to test our algorithm we have carried out both 
"serial emulations" as well as parallel simulations. How- 
ever, since our main goal is to test the performance and 
scaling behavior on parallel machines we have primar- 
ily focused on direct parallel simulations using the Ita- 
nium and AMD clusters at the Ohio Supercomputer Cen- 
ter (OSC) as well as on the Alpha cluster at the Pitts- 
burgh Supercomputer Center (PSC). All of these clus- 
ters have fast communications— the Itanium and AMD 
clusters have Myrinet and the AlphaServer cluster has 
Quadrics. In our simulations, the interprocessor commu- 
nications were carried out using MPI (Message-Passing 
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FIG. 4: Comparison between serial and parallel results for 
the fractal model with L = 256 and D/F = 10 5 . 
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B. Comparison with Serial Results 

We first present a brief comparison between serial and 
parallel results in order to verify the correctness of our 
implementation of the SR algorithm. Figure 0] shows the 
monomer and island densities as a function of coverage 
8 < 0.5 for the fractal model with D/F = 10 5 and system 
size L — 256 with parallel processor sizes N y = 256 and 
N x = 4,8,16 and 64 corresponding to N p = 64,32,16 
and 4 respectively (where N p is the number of proces- 
sors). As can be seen, there is excellent agreement be- 
tween the serial and parallel calculations even for N x = 4, 
the smallest processor size we have tested. Using the 
SR algorithm, we have also obtained excellent agreement 
between serial and parallel results for the monomer and 
island densities for the EC model (not shown). 



C. Calculation of parallel efficiency 

The parallel efficiency is determined by the competing 
effects of communication time, fluctuations, and number 
of relaxation iterations. In particular, the parallel exe- 
cution time tfq p {r) for N p processors in cycle r can be 
written as 



tN p \T~) — t ca l c (T) + t com + t ther j 



(1) 



where t ca i c {r) and t com denote the calculation and com- 
munication time respectively, while the last term t ot her 
includes all other timing costs not included in t ca i c such 
as sorting boundary events received from neighbors and 
comparing new boundary events with old ones to see if a 
new iteration is needed. If there are few boundary events 



(as is often the case), then t other may be ignored. The 
calculation time £ cq z c (t) in Eq. ^may be written as, 



tcalcir) = t X K MC X ( n av(T) + A(t)) 



(2) 



where t l KMC denotes the average serial calculation time 
per KMC event, n av (r) is the average number of actual 
events (averaged over all processors) per processor in cy- 
cle r, and A(t) corresponds to the additional number of 
events which must be processed due to fluctuations and 
relaxation iterations. 

Since all processors are synchronized after each itera- 
tion, in each iteration the total calculation time is deter- 
mined by the processor which has the maximum number 
of events to process. Therefore one may write, 

A ( T ) = E^max^i) + A ™max( r :J ~ 1)]) ~ ™<™(t) (3) 

where ^max( r 'i) ^ s the maximum (over all processors) 
number of new events in the jth iteration, / is the total 
number of relaxation iterations, and A ~ 0.14 is a fac- 
tor which reflects the reduced work to "undo" a KMC 
event as compared to executing a KMC event. Thus, the 
parallel efficiency (PE) can be approximated as 



PE = 



ip 



(tN p (r)) 



1 + 



(A(r)) , t c 



t 



ip 



(4) 



where t\ p = 



x t 



KMC 



is the time for a serial sim- 
ulation of a single processor's domain, n av — (n av (r)), 
and the brackets denote an average over all cycles. If the 
communication time is negligible compared to t\ v (i.e., 
tcom/ tip — > 0), the maximum possible parallel efficiency 
can be approximated as 



PE 



(Mr)) 



(5) 



We note that (A(r)) depends primarily on two quan- 
tities, the (extreme) fluctuations over all processors in 
the number of actual events in each cycle An e /n av = 
("max — n av )/n av , and the average number of iterations 
I per cycle. In particular, one may approximate the ad- 
ditional overhead due to iterations and fluctuations as, 



~ 77(7 - 1)(1 + ) 



(6) 



where a factor of rj with r\ < 1 has been included to take 
into account the fact that after the first iteration, the 
number of new events w max is typically less than n max . 
This result indicates that in the limit of negligible com- 
munications overhead, both the average number of iter- 
ations per cycle I and the relative fluctuations An e /n av 
should be small in order to maximize the parallel effi- 
ciency. We now consider the dependence of each of these 
quantities on the the cycle length T and the number of 
processors N p . 
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FIG. 5: Number of additional iterations as a function of 
average number of events per cycle n av with T = 1/D. For 
the EC model r e = 0.1 and r c — are used with w g — 1 and 2. 
Here N p = 4 and 9 = 1 ML for all cases and D/F = 10 3 - 10 7 . 



D. Number of iterations 

Figure shows the number of additional iterations be- 
yond the first iteration I' = I — 1 as a function of the 
average number of events per cycle n av for the fractal 
and EC models with N p = 4 using strip decomposition 
and two different processor sizes for different values of 
D/F. Also shown in Fig. ODare results for a larger than 
required ghost-region w g — 2 in order to test the depen- 
dence of the number of iterations on the range of interac- 
tion. As can be seen, the number of additional iterations 
is roughly linearly proportional to the average number 
of events per cycle. Interestingly, for the same average 
number of events per cycle n av , the number of additional 
iterations depends relatively weakly on the model, the 
processor height N y , and the value of D/F. However, 
doubling the width of the "ghost" region from w g = 1 to 
w g = 2 leads to an increase by a factor of approximately 
1.5 in the number of iterations. 

Figure [fj] shows the number of additional iterations I' 
as a function of the cycle length T for the fractal model 
with D/F = 10 5 , N p = 4, and N x = 256, N y = 1024. 
As can be seen, the number of additional iterations is 
roughly but not quite proportional to the cycle length 
T. Also shown in Fig. Elis the parallel efficiency, which 
has been directly measured from the execution time us- 
ing the definition given in Eq. 0] The maximum parallel 
efficiency occurs when T = T opt ~ 3.0 x 10 -6 and corre- 
sponds to n av ~ 30 KMC events per cycle. We have also 
calculated (not shown) the optimal cycle length T opt for 
the same processor size for other values of D/F ranging 
from 10 3 to 10 7 . While the optimal cycle length varies by 
approximately two orders of magnitude over this range 
of D/F, the average optimal number of events per cycle 



FIG. 6: Number of additional iterations and parallel effi- 
ciency for the fractal model as a function of the multiplication 
factor <j> with N p = 4, = 1 ML and T = <j>/D. 
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FIG. 7: Number of additional iterations as a function of the 
number of processors N p with 2 < N p < 64 for the fractal 
and edge diffusion models with D/F = 10 5 , T = 1/D and 
9 = 1 ML. In the EC model, r e =0.1 and r c = and the 
width of the ghost region w g — 1 unless specified. 



does not change much— n op t ~ 38 and 30 for D/F = 10 3 
and 10 7 respectively. 

Since the probability of an "extreme" number of 
boundary events in one of the processors increases with 
the number of processors for fixed processor size, the 
number of iterations increases with N p . As shown in 
Fig. [7| such an increase is well described by the loga- 
rithmic form, I — 1 = ao (In N p ) a where the exponent a 
ranges from 0.66 to 1.7 depending on the model and pro- 
cessor size. Fig. [7| also indicates that an increase of the 
interaction length from w g — 1 to w g — 2 also yields an 
approximate doubling in the number of iterations when 




• N 

X 

O N 

X 


= 256 N =lk 

y 

= N = 256 

V 


10 5 

D/F 


10 6 



10' 



FIG. 8: Relative fluctuation in number of events for the 
fractal model as a function of D/F with T = 1/ ' D, N p — 4 
and 6 = 1 ML. 



a fixed time interval T = 1/ D is used. Thus, in order to 
keep the number of iterations constant, the time interval 
must decrease as the range of interaction increases. 



E. Fluctuations in number of events 

A second important factor which determines the paral- 
lel efficiency is the existence of fluctuations in the number 
of events in different processors. In particular, since all 
processors are globally synchronized, the processor hav- 
ing the maximum number of events n max can determine 
the execution time of each iteration. Thus the extreme 
fluctuations An e /n av , as opposed to the usual r.m.s. fluc- 
tuations, determine the parallel efficiency. 

Figure [S] shows the measured fluctuations An e /n av for 
the simple fractal model as a function of D/F for fixed 
processor size N x = 256, N y — 1024 and N p = 4 averaged 
over many cycles. As expected the relative fluctuations in 
a smaller system are larger than those in a bigger system. 
For the simple fractal model, one expects that n av ~ 
Ni ~ (D/F)~ 2 / 3 which implies An e /n av ~ 1/y/n^ ~ 
(D/F) 1 / 3 . As can be seen, there is very good agreement 
with this form for the -D/F-dependence. 

Figure [§] shows the relative (extreme) fluctuations as 
a function of the number of processors N p for the fractal 
and EC models with two different processor sizes. As 
for the dependence of the number of iterations on N p , we 
find a logarithmic dependence. In particular, we find that 
An e /n av ~ (In Np) 1 with 7 = 2/3 regardless of model 
and processor size. Again, for a fixed N p , a bigger system 
shows smaller relative fluctuations than a smaller system. 
In addition, for the same processor size, the EC model 
shows smaller relative fluctuations than the fractal model 
due to the additional number of edge-diffusion events in 
the model. 

We now consider the dependence of the fluctuations 
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FIG. 9: Relative fluctuation in number of events for fractal 
and edge diffusion models as a function of number of proces- 
sors N p with D/F = 10 5 , 6 = 1 ML and T = 1/D, where 



W g = 1 in all cases. In the EC model, 
Dotted lines are linear fits to data. 
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FIG. 10: Power-law decay in the relative fluctuations for 
the fractal model as a function of multiplication factor cj> and 
T = 4>/D with N p = 4 and 6 = 1 ML. 



on the time interval T. For the fractal model, the av- 
erage number of events per cycle in each processor may 
be written, n av = N x N y (F + N X D)T where F is the de- 
position rate, D is the monomer hopping rate, and N± 
is the monomer density. The fluctuation in the number 
of events may be written as the sum of the fluctuation 
(proportional to nj v ) assuming all processors have the 
same average event rate, and an additional term due to 
fluctuations in the number of monomers in different pro- 



cessors, i.e. An e ~ n a 'v + N x N y 5Nf DT. We note that 
the fluctuation 5Nf = 7V{ nax - (JVi) also depends on the 
number of processors N p and the processor size N x N y . 



Dividing to obtain the (extreme) relative fluctuation we 
obtain, 
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(D/F) 5Nf 

1 + (D/F)(N 1 



[N x N y (F/D + (iVi))fl 



-1/2 



(7) 

where cf> = DT. The first term is independent of the cycle 
length T = <j>/D while for D/F » 1, (JV X ) >> F/£> and 
so the second term is simply proportional to -1 ' 2 . As 
can be seen in Fig. 1101 we find good agreement with this 
form for the fractal model with N p — 4, D/F = 10 5 , 
N x — 256, N y = 1024 and the time interval T ranging 
over more than two decades. 



F. Communication time and event optimization 

The third factor which determines the parallel effi- 
ciency is the communications overhead. For the case 
of strip decomposition, in every iteration each processor 
must carry out two local send/receive communications 
with its neighbors. Typically, a send/receive communi- 
cation with a small message size (< 100 bytes) between 
two processors in the same node takes less than 10 /zs 
but it can take longer if they are in different nodes. For 
a larger message size the communication overhead in- 
creases linearly with message size. Since the processor 
size in all of our simulations is moderate, the message 
size is only about 2000 bytes which takes roughly 30fis. 

A global communication (global sum or "AND" and 
broadcast) must also be carried out at the end of every 
iteration to check if a new iteration is necessary. The 
time for the global sum and broadcast is larger than for 
a local send/receive and increases logarithmically with 
the number of processors N p . Overall, we find that the 
estimated minimum total communication overhead per 
cycle is roughly 60/is for a small number of processors. 
In comparison, the serial calculation time t l KMC for one 
KMC event is about 5/zs on the Itanium cluster. Thus, 
even for a small number of processors the overhead due to 
communications (t com /ti p ) is significant unless n av >> 
12. 

One way to maximize the parallel efficiency is to use 
the event optimization method. In this method, the cycle 
length T is dynamically adjusted during the course of the 
simulation in order to achieve a fixed target number of 
events per cycle. By varying the target number of events 
and measuring the simulation time one may determine 
the optimal target number n opt . Figure ITU fa) shows the 
measured parallel efficiency for the fractal model with 
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4, N x = 256, N y = 1024, and D/F 



10 8 



function of the target number of events. As can be seen, 
for a target number given by n op t = 40 there is an opti- 
mal efficiency of approximately 41%. Also shown (dashed 
line) is the parallel efficiency calculated using the mea- 
sured additional number of events (A(r)) due to fluctua- 
tions and relaxation iterations along with the estimated 
communication time which may be approximated by the 
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FIG. 11: (a) Parallel efficiency and (b) measured additional 
number of events and estimated communication time per the 
average number of events as a function of target number of 



events with 6 = XML and N p — 4. 



fit icomAip — 16//n al) . The resulting calculated par- 
allel efficiency curve is close to the measured curve but 
has a slightly lower peak. Fig. ITlT b) shows separately 
the two contributions to the calculated parallel efficiency 
(A(r))/n av and t com /t\ p as a function of the target num- 
ber of events. As can be seen, the peak of the parallel 
efficiency is close to the point where these two contribu- 
tions have the same magnitude. 



G. Parallel efficiency as a function of D/F 

We now consider the parallel efficiency of the SR algo- 
rithm as a function of D/F for the fractal model for a 
fixed number of processors N p — 4. As shown in Fig. 1121 
when a fixed time interval T = 1/D is used, the par- 
allel efficiency (open symbols) shows a distinct peak as 
a function of D/F, with a maximum parallel efficiency 
PE ~ 0.3 for both processor sizes. The existence of such 
a peak may be explained as follows. For small D/F the 
PE is low due to the large number of events in each pro- 
cessor which leads to a large number of boundary events 
and relaxation iterations in each cycle. For large D/F the 
number of events per cycle is reduced but the communica- 
tions overhead and fluctuations become significant due to 
the small number of events. At an intermediate value of 
D/F which increases with increasing processor size, nei- 
ther of these effects dominate and the parallel efficiency 
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FIG. 12: Parallel efficiency for fractal model as function of 
D/F with T — 1/D (open symbols) and event optimization 
(E op t) method (filled symbols). Same symbol shape is used 
for the same processor size. Here N p = 4 and 8 = 1 ML in 
all cases. In the E op t method, n opt — 18 for N x = N y — 256 
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FIG. 13: Parallel efficiency for edge diffusion model as func- 
tion of D/F with T — 1/D (open symbols) and with event 
optimization method (filled symbols). Same symbol shape is 
used for the same processor size. Here, N p = 4, 9 = 1 ML, 
r e = 0.1 and r c = 0. In the E opt method, n op t = 23(30) for 
N X = N S = 256 (N x = 256 and iVy = lfc) for all D/F. Dashed 
line represents ideal parallel efficiency 1/ [1 + (A(r))/n «] us- 
ing n op t = 6 with N x — 256 and N y — Ik. 



is maximum. 

In contrast, when the event optimization method is 
used (filled symbols) the parallel efficiency is significantly 
higher and is almost independent of D/F for D/F < 10 6 . 
Although the value of the optimum target number of 
events n opt increases with processor size there is only a 
weak dependence on D/F for fixed processor size. Also 
shown in Fig. IT21 (dashed line) is the estimated ideal par- 
allel efficiency assuming negligible communications over- 
head. In this case, a small target number of events, 
n op t = 6 was found to yield the maximum ideal paral- 
lel efficiency over the range of D/F studied here. 

Similar results are shown in Fig. 1131 for the edge diffu- 
sion model. Since for the edge-diffusion model the "event 
density" is significantly higher than for the fractal model, 
the communications overhead and fluctuations are signif- 
icantly reduced. As a result, for the case of a fixed time 
interval T — 1/D, the maximum parallel efficiency is 
about 50% for the edge-diffusion model which is signifi- 
cantly higher than the peak value of 30% for the fractal 
case. When the event optimization method is used, the 
PE is also higher than for the fractal case and is again 
roughly independent of D/F. Also shown in Fig. 1131 for 
the larger processor size is the calculated ideal parallel 
efficiency assuming negligible communications overhead 
and a target number of events given by n opt — 6 (dashed 
line). Due to the reduced communications overhead for 
this model, the ideal PE is only slightly higher than the 
corresponding optimal PE with communications included 
(filled squares). 
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number of events n opt = 6. Solid line is a fit to the event- 
optimization data with a form, PE = 1/[1 + 0.81(ln N p ) ]. 



H. Parallel efficiency as a function of N p 

We first consider the dependence of the parallel effi- 
ciency on the number of processors N p with fixed proces- 
sor size. As before, the parallel efficiency is defined as the 
ratio of the execution time for an ordinary serial Simula- 
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tion of one processor's domain to the parallel execution 
time of N p domains using N p processors (Eq. |3J| . Fig. ITU 
shows the parallel efficiency for the fractal model with 
D/F = 10 5 as a function of the number of processors 
N p for fixed processor size. Results (open symbols) are 
shown for two different processor sizes for the case of fixed 
cycle length T = 1/D. Also shown (filled symbols) is 
the parallel efficiency obtained using event optimization 
for the larger processor size. While the parallel efficien- 
cies obtained using event optimization are significantly 
higher than the corresponding results obtained using a 
fixed time interval, the percentage difference decreases 
slightly as the number of processors increases. 
The solid line in Fig. El shows a fit of the form 



PE = 1/[1 + c (In Npf] 



(8) 



(see Eq. 0J to the parallel efficiency obtained for the 
larger processor size using event optimization. As can 
be seen, there is excellent agreement with the simulation 
results. The value of the exponent (j3 = 1.4) is in rea- 
sonable agreement with the dependence of the number of 
additional iterations on N p shown in Fig. [7| Also shown 
in Fig. E| is the ideal parallel efficiency in the absence of 
communication overhead calculated using a target num- 
ber of events given by n op t = 6. As expected, the ideal 
PE is significantly larger than the actual PE even for 
large N p . In this case a similar fit of the form of Eq. [S] 
may be made but with [3 ~ 1.1. 

Fig. E3 shows similar results for the edge-diffusion 
model with D/F = 10 5 , D e = 0.1D and D c = 0. For 
the larger processor size (N x = 256, N y — 1024) both 
the results obtained using a fixed cycle size and those 
using event optimization are very similar to the corre- 
sponding results already obtained for the fractal model. 
However, for a fixed cycle length the parallel efficien- 



cies for the smaller processor size (N x 



N, 



256) are 



somewhat higher than the corresponding results for the 
fractal model. Again, the ideal parallel efficiency is well 
described by a fit of the form of Eq. [H]with /3 ~ 1.1. In 
general, all the parallel efficiencies shown in Figs. 1 141 and 
ITSl are reasonably well described by fits of the form of 
Eq. with 0.66 </3 < 1.5. 

We now consider the dependence of the parallel effi- 
ciency on the number of processors for a fixed total sys- 
tem size L in the case of strip geometry (i.e., N y = L and 
N x = L/Np). Using L = 1024, D/F = 10 5 , E x = 0.1 eV, 
and a step-edge barrier Ef, = 0.07 eV at T = 300 K, we 
have carried out multilayer simulations of growth using 
the reversible model up to a coverage of 10 ML. In this 
case the parallel efficiency may be written as, 
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(9) 



where t\ p is the calculation time for a serial simulation 
of the Lx L system. We expect that in this case the par- 
allel efficiency will decrease more rapidly with increasing 
N p than for the case of fixed processor size, since the 
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FIG. 15: Parallel efficiency for edge diffusion model with 
D/F — 10 5 as a function of N p with T — 1/D (open symbols) 
and with event optimization method (n opt — 30). Dashed 
line represents ideal parallel efficiency obtained using a target 
number of events given by n op t — 6. Solid line is a fit to the 
data for E opt with a form, PE = 1/[1 + 0.54(ln Np) 1 - 5 ]. 



decreased processor size leads to increased fluctuations 
and communications overhead. Using the event optimiza- 
tion method, the parallel efficiencies obtained were 28% 
(N p = 4), 18% (N p = 8) and 9% (N p = 16), respectively, 
which corresponds to an approximate AT 1 dependence 
for the parallel efficiency. 



IV. DISCUSSION 

We have carried out parallel kinetic Monte Carlo 
(KMC) simulations of three different simple models of 
thin-film growth using the synchronous relaxation (SR) 
algorithm for parallel discrete-event simulation. In par- 
ticular, we have studied the dependence of the parallel ef- 
ficiency on the processor size, number of processors, and 
cycle length T, as well as the ratio D/F of the monomer 
hopping rate D to the (per site) deposition rate F. A va- 
riety of techniques for optimizing the parallel efficiency 
were also considered. As expected since the SR algo- 
rithm is rigorous, excellent agreement was found with 
serial simulations. 

Our results indicate that while reasonable parallel ef- 
ficiencies may be obtained for a small number of proces- 
sors, due to the requirement of global communications 
and the existence of fluctuations, the SR algorithm does 
not scale, i.e. the parallel efficiency decreases logarithmi- 
cally as the number of processors increases. In particular, 
for the fractal and edge-diffusion models with event op- 
timization, we have found that the dependence of the 
parallel efficiency as a function of N p may be fit to the 
form PE = [1 + c (InNp) 13 ]' 1 where (3 ~ 1.5. If the com- 
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munication time is negligible compared to the calculation 
time, then the parallel efficiency is higher but a similar 
fit is obtained with an exponent close to 1, i.e. /3 ~ 1.1. 
These results suggest that while the SR algorithm may 
be reasonably efficient for a moderate number of proces- 
sors, for a very large number of processors the parallel 
efficiency may be unacceptably low. These results are 
also in qualitative agreement with the analysis presented 
in Ref. [l8J that for parallel Ising spin simulations using 
the SR algorithm with a fixed cycle length, the parallel 
efficiency should decay as l/(logN p ) for large N p . 

We have also studied in detail the three main factors 
which determine the parallel efficiency in the SR algo- 
rithm. The first is the extra calculation overhead due to 
relaxation iterations which are needed to correct bound- 
ary events in neighboring processors. As expected, the 
number of relaxation iterations /' is proportional to the 
number of boundary events and is also roughly propor- 
tional to both the cycle length T and the range of in- 
teraction. As a result, decreasing the cycle length will 
decrease the overhead due to relaxation iterations. 

The second main factor determining the parallel ef- 
ficiency is the relative (extreme) fluctuation An e /n av in 
the number of events over all processors in each iteration. 
For a fixed number of processors the relative fluctuation 
decreases as one over the square-root of the number of 
events per cycle and is thus inversely proportional to the 
square-root of the product of the processor size and the 
cycle length. As a result, decreasing the cycle length T 
will increase the overhead due to fluctuations. However, 
for a fixed processor size and cycle length the relative fluc- 
tuation also increases logarithmically with the number of 
processors. This increase in the relative fluctuation also 
leads to an increase in the number of relaxation iterations 
with increasing N p as well as decreased processor utiliza- 
tion in each iteration. As a result the parallel efficiency 
decreases as the number of processors increases. 

The third factor determining the parallel efficiency is 
the overhead due to local and global communications. 
For the KMC models we have studied the calculation 
time per event is smaller than the latency time due to 
local communications. As a result, in our simulations 
the optimal parallel efficiency was obtained by using a 
cycle length such that n av >> 1 where n av is the aver- 
age number of events per processor per cycle. In general, 
the optimal value of n av may be determined by balanc- 
ing the overhead due to relaxation iterations and fluctu- 
ations with the overhead due to communications. Since 
the global communications time increases logarithmically 
with the number of processors, the communications over- 
head also leads to a decrease in the parallel efficiency with 
increasing N p . 

In order to optimize the parallel efficiency, we have 
considered and applied several techniques. These include 
(i) carrying out several simulations with a different fixed 
time interval T = 4>/ D in order to determine the opti- 
mum value of <j), (ii) using direct feedback to dynamically 
control the cycle length during a simulation in order to 



maximize the ratio of the average number of events per 
cycle n av to the measured or estimated execution time, 
and (iii) using feedback to dynamically control the cy- 
cle length in order to obtain a pre-determined "target 
number" for an auxiliary quantity such as the number 
of events per cycle or the number of iterations per cy- 
cle. While the first two methods are the most direct, we 
have found that in most cases, the third method results 
in the highest parallel efficiency. However, since there is 
no a priori way of knowing the optimal target number in 
advance, this optimization method must be accompanied 
by additional simulations. 

For the case of negligible communication time, corre- 
sponding to simulations in which the calculation time 
is much longer than the communication time, the cycle 
time should be small in order to minimize the number 
of additional iterations but not too small since a very 
small cycle time will lead to large relative fluctuations. 
For a processor size N x = 256, N y = 1024, we found that 
n opt ~ 6 leads to ideal parallel efficiencies which were sig- 
nificantly larger than obtained using event optimization 
with the communications time taken into account. 

We note that in our simulations we have focused pri- 
marily on the case of strip decomposition in order to 
minimize the communications overhead. However, if the 
calculation time is significantly larger than the commu- 
nications time, then for a square system the parallel effi- 
ciency may be somewhat larger if a square decomposition 
is used instead. To illustrate this we consider the decom- 
position of an L by L system into N p domains. If the 
width of the boundary region or range of interaction is 
given by w, then for the case of strip decomposition the 
area of the boundary region in each processor is given 
by Abdy = 2wL/N p . However ignoring corner effects, the 
area of the boundary region for the square decomposi- 
tion case is given by Abdy — AwL/ \/N p . For N p > 4, the 
area of the boundary region is smaller for square decom- 
position than for strip decomposition. Since the number 
of iterations is roughly proportional to the area of the 
boundary region, the calculation overhead due to relax- 
ation iterations will be larger for N p > 4 for the case of 
strip decomposition. As a result, we expect that for a 
fixed (square) system size and a large number of proces- 
sors, and for systems (unlike those studied here) with a 
high ratio of (per event) calculation time to communi- 
cations time, square decomposition may be significantly 
more efficient than strip decomposition. 

We also note that in our simulations we have used two 
slightly different definitions for the parallel efficiency. In 
the first definition (Eq. 0J, the parallel execution time 
was compared with the serial execution time of a system 
whose size is the same as a single processor. In contrast, 
in the second definition (Eq. [5| the parallel execution 
time was directly compared with 1/N P times the serial 
execution time of a system whose total system size is the 
same as in the parallel simulation. If the serial KMC 
calculation time per event is independent of system size, 
then there should be no difference between the two def- 
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initions. Since in the models studied here we have used 
lists for each type of event, we would expect the serial cal- 
culation time per event to be independent of system size, 
and thus the two definitions of parallel efficiency should 
be equivalent. To test if this is the case, we have calcu- 
lated the serial simulation time per event for the fractal 
model for D/F = 10 3 and D/F = 10 5 for a variety of 
system sizes ranging from L = 64 to L = 2048. Some- 
what surprisingly, we found that the serial calculation 
time per event increases slowly with increasing processor 
size. In particular, an increase of approximately 50% in 
the calculation time per event was obtained when going 
from a system of size L = 64 to L = 2048. We believe 
that this is most likely due to memory or "cache" effects 
in our simulations. This increase in the serial calculation 
time per event with increasing system size indicates that 
the calculated parallel efficiencies shown in Fig. O and 
Fig. El would actually be somewhat larger if the more 
direct definition of parallel efficiency (Eq. EJ were used. 
We now discuss some possible improvements of the 
method described here. As already noted, in our parallel 
KMC simulations, lists were used in order to maximize 
the serial efficiency. However, for simplicity each addi- 
tional iteration after the first iteration was restarted at 
the beginning of the cycle rather than starting with the 
first new boundary event in each processor. By using 
the more efficient method of only redoing events starting 
with the first new boundary event, we expect a possible 
maximum increase in the parallel efficiency of approx- 
imately 25% over the results presented here. In addi- 
tion, it is also possible that by using improved feedback 
methods, the parallel efficiency may be somewhat fur- 
ther increased. For example, by modifying the feedback 



algorithm it may be possible to further improve the di- 
rect optimization method. It may also be possible to 
combine all three optimization methods to obtain an im- 
proved parallel efficiency for a given simulation. 

Finally, we note that the main reason for the low paral- 
lel efficiency for a large number of processors is the global 
requirement that all processors must be perfectly syn- 
chronized. However, for systems with short-range inter- 
actions it should be possible to at least temporarily relax 
this synchronization requirement for processors which are 
sufficiently far away from one another. Thus, in large sys- 
tems with a large number of processors it may be possible 
to increase the parallel efficiency by slightly modifying 
the SR algorithm by making it somewhat less restrictive. 
In this connection, we have recently developed^ a semi- 
rigorous synchronous sublattice algorithm which yields 
excellent agreement with serial simulations for all but 
the smallest processor sizes and in which the asymptotic 
parallel efficiency is constant with increasing processor 
number. By combining the synchronous sublattice algo- 
rithm with the SR algorithm it may be possible to obtain 
a hybrid algorithm which contains the best features of 
both e.g. accuracy and efficiency. 
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